This notebook is the time series analysis portion of the project "To Everything There is a Season : Using Weather Data and Demographic Information in the Predictive Modeling of Crimes in Dallas, Texas" by Ashley Steele.
1. Importing Libraries and Setting Preferences
# Importing all important libraries here!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from datetime import datetime
from dateutil.parser import parse
from fbprophet import Prophet
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Setting notebook preferences
pd.set_option('precision', 2)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)
sns.set(style = 'whitegrid', font_scale = 1.25)
For this analysis we are going to use our final dataset as it has all of the information on weather and temperature that we need.
# Importing our final merged dataset to use for modeling
df = pd.read_csv('complete.csv')
df.head()
# Looking for duplicates, just in case
df['Unnamed: 0'].nunique()
len(df)
# Getting rid of our unnecessary columns
df.drop(columns = 'Unnamed: 0', inplace = True)
Now that we have our data loaded let's determine the main goal of our time series analysis and refine our data for modeling!
The main research question/hypothesis for our time series analysis is: how, if at all, does daily temperature affect daily crime reports in Dallas, Texas?
Since our main focus is on time-based features it is super important thaat we have all of our data set up correctly. Let's start working on this below!
df['date_only'] = pd.to_datetime(df['date_only'])
# Creating new columns for the datetime versions of these
df['mnth'] = df['date_only'].dt.month
df['day_of_year_number'] = df['date_only'].dt.dayofyear
# Sanity check: does everything look like we expect it to?
df.head()
# Getting rid of our old columns
df.drop(columns = ['month', 'day_of_the_year'], inplace = True)
# Just checking our overall info for the df again
df.info()
Before we can even begin to start thinking of modeling we have a few main issues to deal with, such as null values, distribution of specific features, and looking closely at how our features correlate (if at all) to our target variable.
# Where my nulls at?
df.isnull().sum().sort_values(ascending = False)
# Dropping nulls so we can model
df.dropna(inplace = True)
# Taking a quick look to see how much the length of our data changed by dropping nulls.
len(df)
# Dropping count features where we have a %
df.drop(columns = ['male', 'female', 'pop_over_16', 'zip_code.1'], inplace = True)
# Creating a daily count df to see if I can add it as a feature
daily_count = pd.DataFrame(df['date_only'].value_counts())
# What does our daily_count look like?
daily_count.sort_values(by = 'date_only', inplace = True)
daily_count.head(500)
# Cleaning up of new df
daily_count.reset_index(inplace = True)
daily_count.rename(columns = {'index':'date_only', 'date_only':'daily_crime_count'}, inplace = True)
daily_count.sort_values(by = 'date_only', inplace = True)
daily_count.head()
# Combining dataframes
df = pd.merge(df, daily_count, how = 'left', on= 'date_only' )
df.head(100)
Since we are doing this project as if we are working in an actual work situation where analyzing time series data is a daily part of our position, we are going to approach our modeling in that manner. In a real-life situation it is of the utmost importance to quickly see if 1.) the data we have is actually worth something in regards to modeling, and 2.) preform a quick and "dirty" linear regression using OLS to verify that our data is actually worth using. Let's get started with a corrleation matrix for our data!
# How does everything relate to each other?
corr = df.corr().abs()
corr
# What do our correlations look like visually?
plt.figure(figsize = (20,15))
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns);
AWESOME! Our two main features we hypothesized were important to each other, daily crime count and temperature, do show a correlation! Even though, in a "perfect situation", a 20% correlation between variables doesn't seem too impresive since we are using real-life data this is a great sign that our two features are actually connected! Since we've looked at our connection between the two let's quickly take a look at what each features looks like seperately.
# What is the distribution of our daily crime counts?
plt.figure(figsize = (20,10))
sns.distplot(df['daily_crime_count'])
plt.title('Distribution of Daily Crime Reports')
plt.xlabel('Daily Crime Count')
plt.ylabel('Occurrence');
# What is the distribution of our temperature?
plt.figure(figsize = (20, 10))
sns.distplot(df['temp_in_F'])
plt.title('Distribution of Temperature (F)')
plt.xlabel('Temperature in Farenheit')
plt.ylabel('Occurrence');
NICE! Both of our chosen features have a pretty regular distribution and no crazy outliers which means we can move on to the next step of our analysis.
In addition to our two main features we are focusing on for our modeling there were also several other correlations that stood out, though not enough to reach our threshold for modeling. Let's go ahead and visualize these relationships below!
# What does the relationship between crime counts and % of Black/African American residents look like?
plt.figure(figsize = (20,15))
sns.lineplot(x= df['%_black'], y = df['daily_crime_count'], data = df, color = 'red')
plt.title('Percentage of Black/African American Residents vs. Crime Count in Dallas, Texas 2015-2018')
plt.xlabel('% of Black/African American Residents')
plt.ylabel('Daily Crime Count');
What does this tells us about this relationship? There is a ton of activity going on around the 10- 30% area of populations, meaning that crime counts increase where the Black/African American population is between 10-30%. We can also see some interesting spikes and valleys throughout this data but nothing is consistant enough to warrant a higher correlation.
# What does the relationship between crime counts and % of Native residents look like?
plt.figure(figsize = (20,15))
sns.lineplot(x= df['%_native'], y = df['daily_crime_count'], data = df, color = 'orange')
plt.title('Percentage of Native American Residents vs. Crime Count in Dallas, Texas 2015-2018')
plt.xlabel('% of Native American Residents')
plt.ylabel('Daily Crime Count');
What does this tells us about this relationship?
We see a ton of activity in the first part of this chart, which makes sense since there is a very small Native American population in Dallas. This graph is telling us that crime count is pretty high in smaller populations of Native Americans in Dallas.
# What does the relationship between crime counts and % of Asian residents look like?
plt.figure(figsize = (20,15))
sns.lineplot(x= df['%_asian'], y = df['daily_crime_count'], data = df, color = 'green')
plt.title('Percentage of Asian Residents vs. Crime Count in Dallas, Texas 2015-2018')
plt.xlabel('% of Asian')
plt.ylabel('Daily Crime Count');
What does this tells us about this relationship?
According to our data it also appears that Dallas has a relatively small percentage of the population that is Asian in specific areas. The interesting thing on this chart is the large downward spike around 4% implying that crime counts drastically lower in areas with an Asian population of 4%.
# # What does the relationship between crime counts and % of population over 16 years old look like?
plt.figure(figsize = (20,15))
sns.lineplot(x= df['%_pop_over_16'], y = df['daily_crime_count'], data = df, color = 'blue')
plt.title('Percentage of Population Over 16 vs. Crime Count in Dallas, Texas 2015-2018')
plt.xlabel('Percentage of Population Over 16')
plt.ylabel('Daily Crime Count');
What does this tells us about this relationship?
We see that since there is a ton of movement in the first part of this graph that crime counts and population over 16 are most closely related when the population over 16 is smaller.
# What does the relationship between crime counts and % of population employed look like?
plt.figure(figsize = (20,15))
sns.lineplot(x= df['%_employed'], y = df['daily_crime_count'], data = df, color = 'purple')
plt.title('Percentage of Population Employed vs. Crime Count in Dallas, Texas 2015-2018')
plt.xlabel('Percentage of Population Employed')
plt.ylabel('Daily Crime Count');
What does this tells us about this relationship?
In populations where 45- 75% of the population are employed the greatest activity in crime count occurs. This connection is weaker the further out you go from this group in either direction.
# What does the relationship between crime counts and % of the population that is unemployed look like?
plt.figure(figsize = (20,15))
sns.lineplot(x= df['%_unemployed'], y = df['daily_crime_count'], data = df, color = 'red')
plt.title('Percentage of Population Unemployed vs. Crime Count in Dallas, Texas 2015-2018')
plt.xlabel('Percentage of Population Unemployed')
plt.ylabel('Daily Crime Count');
What does this tells us about this relationship?
Crime counts increase in populations with more than 3% of the population experiencing unemployment but then levels out when a larger (8%) portion of the population is unemployed.
# What does the relationship between crime counts and mean household income look like?
plt.figure(figsize = (20,15))
sns.lineplot(x= df['mean_household_income'], y = df['daily_crime_count'], data = df, color = 'orange')
plt.title('Mean Household Income vs. Crime Count in Dallas, Texas 2015-2018')
plt.xlabel('Mean Household Income')
plt.ylabel('Daily Crime Count');
What does this tells us about this relationship?
Households with a mean income of between 100,000 to 200,000 USD experience the most drastic activity in relation to crime count, while the amounts of crimes reported seems to fall after a household mean income of 250,000 USD.
# What does the relationship between crime counts and % of families in poverty look like?
plt.figure(figsize = (20,15))
sns.lineplot(x= df['%_families_poverty'], y = df['daily_crime_count'], data = df, color = 'yellow')
plt.title('Percentage of Families in Poverty vs. Crime Count in Dallas, Texas 2015-2018')
plt.xlabel('Percentage of Families in Poverty')
plt.ylabel('Daily Crime Count');
What does this tells us about this relationship?
This activity has almost a linear relationship to it. We can definately see an upward trend that as the percentages of families in poverty increase in an area the number of reported crimes increase as well.
# What does the relationship between crime counts and year look like?
plt.figure(figsize = (20,15))
sns.lineplot(x= df['year'], y = df['daily_crime_count'], data = df, color = 'blue')
plt.title('Year vs. Crime Count in Dallas, Texas 2015-2018')
plt.xlabel('Year')
plt.ylabel('Daily Crime Count');
What does this tells us about this relationship?
We can see that the amount of reported crimes took a HUGE drop at the beginning of 2017 but began to increase shortly after.
As we discussed earlier, in real-life situations we won't have the time to do a full ARIMA model on all data we are given, and sometimes the data we will be asked to look at won't actually have any real value to our business. With that in mind one quick way to see how important our features are to modeling is to use the OLS measures on each feature individually before looking at their combined scores. Since we only have two features here, daily crime count and temperature, that we are basing our model on this is a pretty short process.
import statsmodels.api as sm
X = df['temp_in_F']
Y = df['daily_crime_count']
# Looking at OLS for temp vs. target
X = sm.add_constant(X)
results = sm.OLS(Y, X).fit()
results.summary()
We know that there are several important statistics on the chart above that can tell us whether our data is worth continuing with or not. First, we have our R Squared and Adjusted R Squared scores. Our R Squared score tells us how well our model explains the variance in our dataset. We only two feature we are considering, so our R Squared score is pretty low as our information only really relates to each other and doesn't really explain EVERYTHING that is happening. (This is most likely a result of adding in a large number of features and the fact that temperature doesn't really correlate to any other feature.) Another additional feature that is important to whether our model will be work out time is the P Value. Here we don't have a p value, but, once again, this may be due to the size of the dataset and the un-correlated-ness of our other data. We can simplify our dataframe and try this test again, just to see if there is a noticable difference.
# Simplfying our data below
temp_crime_only = df[['date_only','daily_crime_count', 'temp_in_F']]
temp_crime_only = temp_crime_only.groupby('date_only').mean()
temp_crime_only.head()
# Running our OLS on our reduced data
X = temp_crime_only['temp_in_F']
Y = temp_crime_only['daily_crime_count']
# Looking at OLS for temp vs. target
X = sm.add_constant(X)
results = sm.OLS(Y, X).fit()
results.summary()
Our R Squared score has increased as a result of simplifying our dataset, but we are still without a P Value. For the purposes of this project we will keep our features as is and move on to setting up our dataframes for modeling.
Research question for our project: Can we use past records of crimes and the corresponding weather for the day of the crime to help predict future crimes?
Since our research question focuses mainly on number of crimes per day and weather let's simplify our testing datasets to reflect these features. Since time series data is best analyzed as a univariate dataset we will have to create a few sub-datasets focusing on the individual features we want to treat as time series data, work with them individually, and then compare them to our other time series data to draw our final conclusions. Let's get started!
# Since we've already created a daily crime count sub-df let's start there
daily_count.head()
# Changing the index of this df to date_only
daily_count.set_index('date_only', inplace = True)
# Sanity check: did our index actually reset?
daily_count.head()
# Let's take a quick look at this sub-df and it's details
daily_count.describe()
Now that we have our daily crime count and date as a seperate data frame we can move on to our date vs. temperature data frame.
# Creating our initial data frame for date and temperature
temp_df = df[['date_only', 'temp_in_F']].copy()
# Checking that it looks ok
temp_df.head()
# Smushing these down to have one date and one temp each
temp_df = temp_df.groupby('date_only').mean()
# Sanity check: did the smushing work correctly?
temp_df.head()
# Let's look at the overall info for this dataset
temp_df.describe()
Excellent! We now have nice, neat sub-dataframes that we can now use to do our time series analysis modeling with! Next step: decomposing these!
Although, at first glance, time series data appears to just be one set of data there are really four main components that must be dealt with seperately in order to process time series data correctly. These four parts are: trend, seasonality, noise/error, and base line data. Below we will decompose our time series datasets into their respective parts in order to better understand what is really happening in our data!
# Plotting our dataset to see what it looks like
plt.figure(figsize = (20,10))
plt.plot(daily_count)
plt.title('Date vs. Crime Count: Raw Data')
plt.xlabel('Date')
plt.ylabel('Daily Crime Count');
Looking at the visualization above we can see that our data appears to have some seasonality (there are repeated spikes and valleys in the line), there is some sort of trend (our line almost has a "typical" wave look to it- it follows several up and down curves), and there seems to be some weird data happening around the beginning of 2018. Let's first take a closer look at that weird data before we decompose this data.
# What days did we have less than 20 crime reports?
daily_count.loc[daily_count['daily_crime_count']< 20]
Cognitively, we know that it is very unlikely that the Dallas Police Department, who serves a population of 1.3 million people daily, had several days with less than 20 reported crimes, so let's go ahead and drop these days.
daily_count= daily_count.loc[daily_count['daily_crime_count']> 20]
# Double checking, visually, that our data looks more regular now
plt.figure(figsize = (20,10))
plt.plot(daily_count)
plt.title('Date vs. Crime Count: Raw Data')
plt.xlabel('Date')
plt.ylabel('Daily Crime Count');
That looks so much better! Now that our data is more solid we can begin to seperate out the components of our time series.
# To do this we need to import the correct tools
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(daily_count, freq = 365)
# Getting the values for our parts
trend_crime = decomposition.trend
seasonal_crime = decomposition.seasonal
noise_crime = decomposition.resid
observed_crime = decomposition.observed
# Plotting out what these parts look like visually
# Plotting original data
plt.figure(figsize= (15,20))
plt.subplot(411)
plt.plot(daily_count, label = 'Original Data')
plt.legend(loc = 'best')
plt.title('Original Date vs. Crime Count')
# Plotting Trend
plt.subplot(412)
plt.plot(trend_crime, 'g', label = 'Trend')
plt.legend(loc = 'best')
plt.title('Trend')
# Plotting Seasonality
plt.subplot(413)
plt.plot(seasonal_crime,'y', label = 'Seasonality')
plt.legend(loc = 'best')
plt.title('Seasonality')
# Plotting Noise
plt.subplot(414)
plt.plot(noise_crime, 'r', label = 'Noise')
plt.legend(loc = 'best')
plt.title('Noise');
While it is SUPER COOL to see our data "exploded" like this, we still have a few more steps to do before we can actually model and forecast this bad boy. Let's check the stationarity (checking to make sure the rolling average and standard deviation don't change over time).
There are two main ways we can check the stationarity of our data: looking at our rolling statistics and preforming a "Dickey-Fuller" test. Since we are just baby data scientists and time lords let's go a head and try both out to see what happens!
# Determining our rolling statistics
rol_mean = daily_count.rolling(12).mean()
rol_std = daily_count.rolling(12).std()
# What do these look like visually?
plt.figure(figsize = (20, 15))
original = plt.plot(daily_count, color = 'blue', label = 'Original')
mean = plt.plot(rol_mean, color = 'green', label = 'Rolling Mean')
std = plt.plot(rol_std, color = 'purple', label = 'Rolling STD')
plt.legend(loc = 'best')
plt.title('Moving Mean and STD')
plt.show(block = False)
# Trying out the Dickey-Fuller test
from statsmodels.tsa.stattools import adfuller
# Preforming our Dickey-Fuller test
test = adfuller(daily_count['daily_crime_count'], autolag = 'AIC')
print(test)
# Translating Our Results
dfoutput = pd.Series(test[0:4], index = ['Test Statistic', 'p-value', '# Lags Used', 'Number of Observations Used'])
print('Results of Dickey-Fuller Test: ')
for key, value in test[4].items():
dfoutput['Critical Value (%s)' %key] = value
print(dfoutput, '\n')
For a Dickey-Fuller test there are two main areas we pay attention to: test statistic and critical value. We know our time series is stationary if our test statistic is smaller than our critical value. In our test above, since our numbers are on the negative scale, we see that on our first iteration of the test that our critical value(-3.43) is larger (closer to zero = bigger) than our test statistic (-4.10) so we can tell right away that our date vs. crime count time series is stationary! Let's move on to our second time series!
# Plotting our dataset to see what it looks like
plt.figure(figsize = (20,10))
plt.plot(temp_df)
plt.title('Date vs. Temperature in F: Raw Data')
plt.xlabel('Temperature in Farenheit')
plt.ylabel('Daily Crime Count');
Overall this data doesn't look too bad! We have a weird outlier in 2016 but it pretty much looks like a similar pattern to other years so we can move on to decomposing this data!
# Setting up the decomposition
decomposition = seasonal_decompose(temp_df, freq = 365)
# Getting the values for our parts
trend_temp = decomposition.trend
seasonal_temp = decomposition.seasonal
noise_temp = decomposition.resid
# Plotting out what these parts look like visually
# Plotting original data
plt.figure(figsize= (15,20))
plt.subplot(411)
plt.plot(temp_df, label = 'Original Data')
plt.legend(loc = 'best')
plt.title('Original Date vs. Temperature')
# Plotting Trend
plt.subplot(412)
plt.plot(trend_temp, 'g', label = 'Trend')
plt.legend(loc = 'best')
plt.title('Trend')
# Plotting Seasonality
plt.subplot(413)
plt.plot(seasonal_temp,'y', label = 'Seasonality')
plt.legend(loc = 'best')
plt.title('Seasonality')
# Plotting Noise
plt.subplot(414)
plt.plot(noise_temp, 'r', label = 'Noise')
plt.legend(loc = 'best')
plt.title('Noise');
Just like we did above we need to test if this data is stationary or not.
# Determining our rolling statistics
rol_mean = temp_df.rolling(12).mean()
rol_std = temp_df.rolling(12).std()
# What do these look like visually?
plt.figure(figsize = (20, 15))
original = plt.plot(temp_df, color = 'blue', label = 'Original')
mean = plt.plot(rol_mean, color = 'green', label = 'Rolling Mean')
std = plt.plot(rol_std, color = 'purple', label = 'Rolling STD')
plt.legend(loc = 'best')
plt.title('Moving Mean and STD')
plt.show(block = False)
# Preforming our Dickey-Fuller test
test = adfuller(temp_df['temp_in_F'], autolag = 'AIC')
# Translating these results into an easier to read format
dfoutput = pd.Series(test[0:4], index = ['Test Statistic', 'p-value', '# Lags Used', 'Number of Observations Used'])
print('Results of Dickey-Fuller Test: ')
for key, value in test[4].items():
dfoutput['Critical Value (%s)' %key] = value
print(dfoutput, '\n')
As stated above our goal, to prove that our time series is stationary, is to have a test statistic that is smaller than our critical value. Here we see that our critival value at 10% (-2.57) is larger than our test statistic (-2.73) so we know that our time series is stationary! Since both of our time series datasets are stationary we can use the data as it is and finally move on to the modeling portion of our time series analysis.
Time series analysis, in most forms, is a univariate analysis. Since we are using the AMAZING Prophet package created by Facebook to help simplify and streamline Auto-Recursion modeling in time series data we are able to preform both uni and multivariate models on our data. Let's start off simply with our univariate models.
Before we can do anything else let's go ahead and make the new sub-data frames we will use for our models!
# Editing our daily count info for prophet
daily_count.reset_index(inplace = True)
daily_count.rename(columns = {'date_only':'ds', 'daily_crime_count':'y'}, inplace = True)
daily_count.head()
# Setting up our temperature info to work with Prophet
temp_df.reset_index(inplace = True)
temp_df.rename(columns = {'date_only':'ds', 'temp_in_F':'y'}, inplace = True)
temp_df.head()
Since we already have real-time data for the majority of 2019 we can edit it now so we can compare it to our predicted values after modeling!
# Importing our 2019 data here
df2019 = pd.read_csv('df2019.csv')
# Checking to see what this data looks like
df2019.head()
# Dropping any null values!
df2019.dropna(inplace = True)
# Creating our date vs. crime count subset
crime2019 = pd.DataFrame(df2019['date_only'].value_counts())
# Standardizing our new sub-df and taking a quick peek!
crime2019.rename(columns = {'date_only':'count'}, inplace = True)
crime2019.reset_index(inplace = True)
crime2019['index'] = pd.to_datetime(crime2019['index'])
crime2019.set_index('index', inplace = True)
crime2019.sort_index(inplace= True)
crime2019.head()
# Slightly modifying our 2019 crime count list so we can use it to merge with our total crimes
crime_2019 = crime2019.copy()
crime_2019.reset_index(inplace = True)
crime_2019.rename(columns = {'index':'ds', 'count':'y'}, inplace = True)
crime_2019['ds']= pd.to_datetime(crime_2019['ds'])
crime_2019.head()
# Adding our 2019 data to our total data
total_crime = pd.concat([daily_count, crime_2019], axis = 0)
# What does our new df look like?
total_crime.head()
Excellent! Now that we have our date vs. crime count data let's move on to our date vs. temperature data.
# Creating our date vs. temperature data frame for 2019
temp2019 = pd.DataFrame(df2019[['date_only', 'temp_in_F']])
# Merging our temps down by day & prepping for Prophet
temp2019 = temp2019.groupby('date_only').mean()
temp2019.reset_index(inplace= True)
temp2019['date_only'] = pd.to_datetime(temp2019['date_only'])
temp2019.set_index('date_only', inplace = True)
temp2019.head()
# Making another sub-df to help us in merging with total temperature data
temp_2019 = temp2019.copy()
temp_2019.reset_index(inplace= True)
temp_2019.rename(columns = {'date_only':'ds', 'temp_in_F':'y'}, inplace= True)
#Converting our date to datetime
temp_2019['ds']= pd.to_datetime(temp_2019['ds'])
temp_2019.head()
# Creating a master temp df
total_temp = pd.concat([temp_df, temp_2019], axis = 0)
total_temp.tail()
#Adding our prior year dataframes together
all_old = pd.merge(daily_count, temp_df, how='inner', on = 'ds')
all_old.rename(columns = {'y_x':'y', 'y_y':'temp'}, inplace = True)
all_old.head()
crime_2019.head()
# Adding all of our 2019 data together
all_new = pd.merge(crime_2019, temp_2019, how= 'inner', on = 'ds')
all_new.rename(columns = {'y_x':'y', 'y_y':'temp'}, inplace = True)
all_new.head()
Now that all of our data is neatly combined we can split our final data into test and training sets.
# Creating a split of crimes for testing and training - crime
total_crime['year'] = total_crime['ds'].dt.year
train_crime = total_crime.loc[total_crime['year'] <= 2017]
test_crime = total_crime.loc[total_crime['year']> 2017]
# Drop the years column and finish
train_crime.drop(columns= 'year', inplace = True)
test_crime.drop(columns = 'year', inplace = True)
total_crime.drop(columns = 'year', inplace = True)
# Creating a split of crimes for testing and training - temp
total_temp['year'] = total_temp['ds'].dt.year
train_temp = total_temp.loc[total_temp['year'] <= 2017]
test_temp = total_temp.loc[total_temp['year']> 2017]
train_temp.drop(columns = 'year', inplace = True)
test_temp.drop(columns = 'year', inplace = True)
total_temp.drop(columns = 'year', inplace = True)
Hooray! We now have training and testing datasets for data! Let's move on to modeling!
Now that we have all of our data (and sub-datasets) set up we can begin with our univariate modeling. Since we are using the Prophet package to assist in our modeling we are limited with the amount of tuning and refining we can do in our modeling. Most of what we do to refine our models will be on the front end with how we set our data up, split our data, and what sub-sets to use when.
Important note:
Above we have two seperate sets of data for our different approaches to modeling. First, we have the actual collected data (daily counts from 2015 to 2018 and data from 2019 only) in two seperate databases. Additionally, we have a user-created split the overall data (combined data from 2015-20017 and combined data from 2018 to 2019). Due to the lack of options for model tuning we will try out both datasets and approaches to see which works best!
# Setting up our model in Prophet - Original Crime Data ASIS
model_crime = Prophet(daily_seasonality = True)
model_crime.fit(daily_count)
# Fixing for model
crime_2019.reset_index(inplace = True)
# Forecasting with our actual 2019 data
crime_test_fcst = model_crime.predict(df=crime_2019)
# Plot the forecast
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
fig = model_crime.plot(crime_test_fcst, ax=ax)
plt.title('Actual and Predicted Crimes (2015 - 2019)- Actual Data')
plt.xlabel('Year')
plt.ylabel('Crime Counts');
# Plot the components (exploded view)
fig = model_crime.plot_components(crime_test_fcst)
plt.tight_layout()
# Run test statistics
mse_crime_a = mean_squared_error(y_true=crime_2019['y'],y_pred=crime_test_fcst['yhat'])
mae_crime_a = mean_absolute_error(y_true=crime_2019['y'],y_pred=crime_test_fcst['yhat'])
def mean_absolute_percentage_error(y_true, y_pred):
"""Calculates MAPE given y_true and y_pred"""
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
mape_crime_a = mean_absolute_percentage_error(y_true=crime_2019['y'],y_pred=crime_test_fcst['yhat'])
We will review what these statistics mean to our models below!
# Setting up our model in Prophet - crime training
model_crime = Prophet(daily_seasonality = True)
model_crime.fit(train_crime)
# Using our test data to make predictions
crime_test_fcstb = model_crime.predict(df=test_crime)
# Plot the forecast
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
fig = model_crime.plot(crime_test_fcstb, ax=ax)
plt.title('Actual and Predicted Crime Count (2015 - 2019)- Self Split')
plt.xlabel('Year')
plt.ylabel('Crime Count');
# What do the individual components of this forecast look like?
fig = model_crime.plot_components(crime_test_fcstb)
plt.tight_layout()
Let's now look at what our actual data looks like versus our predictions!
# Making a quick adjustment for modeling!
crime_for1 = crime_test_fcst.copy()
crime_for2 = crime_test_fcstb.copy()
crime_for2.set_index('ds', inplace = True)
crime_for1.set_index('ds', inplace = True)
# Another quick adjustment for modeling
crime_test = test_crime.copy()
crime_test.set_index('ds', inplace = True)
crime_test.head()
# What does predicted 2018 & 2019 crimes look like vs. actual?
plt.figure(figsize= (20,10))
# Plotting our actual data
plt.subplot(211)
plt.plot(crime_test['y'], color= 'green')
plt.title('Actual Crime Report Counts in Dallas, Texas: 2018-2019')
plt.xlabel('Date')
plt.ylabel('Counts');
# Plotting predicted crimes
plt.subplot(212)
plt.plot(crime_for2['yhat'], color = 'red')
plt.title('Predicted Crime Report Counts in Dallas, Texas: 2018-2019')
plt.xlabel('Date')
plt.ylabel('Counts');
plt.tight_layout()
# What do predicted 2019 crimes look like vs. actual?
plt.figure(figsize= (20,10))
# Plotting our actual data
plt.subplot(211)
plt.plot(crime2019['count'], color= 'green')
plt.title('Actual Crime Report Counts in Dallas, Texas: 2019')
plt.xlabel('Date')
plt.ylabel('Counts');
# Plotting predicted crimes
plt.subplot(212)
plt.plot(crime_for1['yhat'], color = 'red')
plt.title('Predicted Crime Report Counts in Dallas, Texas: 2019')
plt.xlabel('Date')
plt.ylabel('Counts');
plt.tight_layout()
# Run test statistics
mse_crime_b = mean_squared_error(y_true= test_crime['y'],y_pred= crime_test_fcstb['yhat'])
mae_crime_b = mean_absolute_error(y_true= test_crime['y'],y_pred= crime_test_fcstb['yhat'])
def mean_absolute_percentage_error(y_true, y_pred):
"""Calculates MAPE given y_true and y_pred"""
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
mape_crime_b = mean_absolute_percentage_error(y_true= test_crime['y'],y_pred=crime_test_fcstb['yhat'])
print('Crime Test Statistics (2019 Data Only):')
print('The mean squared error is: ', mse_crime_a)
print('The mean absolute error is: ', mae_crime_a)
print('The mean absolute percentage error is :', mape_crime_a)
print('\n')
print('Crime Test Statistics (2018-2019 Data):')
print('The mean squared error is: ', mse_crime_b)
print('The mean absolute error is: ', mae_crime_b)
print('The mean absolute percentage error is :', mape_crime_b)
We can see several things in our test statistics. Let's start with our mean squared error. We know that the closer a mean squared error(MSE) is to zero the better the model is. We can see that our model with our real data divisions did the best since it has the lower MSE score. Our mean absolute percentage error (MAPE) is the measurement of how accurate our model is. As with MSE we want our numbers to be lower since this tells us that there is less error.
Based on our statistics above it would be best for us to continue to use our originally split data as our test and train sets and disregard our self-made split data. Let's move on to temperature!
# Creating our model for temperature data
model_temp = Prophet(daily_seasonality = True)
model_temp.fit(temp_df)
# Temp. fix for model
temp_2019.reset_index(inplace = True)
# Making our ste of predictions using our actual 2019 data
temp_test_fcsta = model_temp.predict(df = temp_2019)
# What does this look like visually?
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
fig = model_temp.plot(temp_test_fcsta, ax=ax)
plt.title('Actual and Predicted Temperatures for Dallas, Texas (2015-2019)- 2019 Testing Only')
plt.xlabel('Year')
plt.ylabel('Temperature (F)');
# Plot the components- original data
fig = model_temp.plot_components(temp_test_fcsta)
plt.tight_layout()
# Test statistics, yo!
mse_temp_a = mean_squared_error(y_true= temp_2019['y'],y_pred= temp_test_fcsta['yhat'])
mae_temp_a = mean_absolute_error(y_true= temp_2019['y'],y_pred= temp_test_fcsta['yhat'])
def mean_absolute_percentage_error(y_true, y_pred):
"""Calculates MAPE given y_true and y_pred"""
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
mape_temp_a = mean_absolute_percentage_error(y_true= temp_2019['y'],y_pred= temp_test_fcsta['yhat'])
Look below for our statistics and a breakdown of what they mean!
# Modeling our temperature data with our self-created split sets
model_temp = Prophet(daily_seasonality = True)
model_temp.fit(train_temp)
# Making another set of predictions using our test data
temp_test_fcstb = model_temp.predict(df = test_temp)
# Plot the forecast- Self made split data
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
fig = model_temp.plot(temp_test_fcstb, ax=ax)
plt.title('Actual and Predicted Temperature Dallas, Texas (2015-2019)- 2018-2019 Testing')
plt.xlabel('Date')
plt.ylabel('Temperature (F)')
# Plot the components- Self-Split data
fig = model_temp.plot_components(temp_test_fcstb)
plt.tight_layout()
# Making a quick adjustment
plot_test_temp = test_temp.copy()
plot_test_temp.set_index('ds', inplace = True)
plot_test_temp.head()
# What does predicted 2018 & 2019 temperature look like vs. actual?
plt.figure(figsize= (20,10))
# Plotting our actual data
plt.subplot(211)
plt.plot(plot_test_temp['y'], color = 'green')
plt.title('Actual Temperature in Dallas, Texas: 2018-2019')
plt.xlabel('Date')
plt.ylabel('Temperature (F)');
# Plotting predicted temps
plt.subplot(212)
plt.plot(temp_test_fcstb['yhat'], color = 'red')
plt.title('Predicted Temperature in Dallas, Texas: 2018-2019')
plt.xlabel('Date')
plt.ylabel('Temperature (F)');
plt.tight_layout()
# What does predicted 2019 temperature look like vs. actual?
plt.figure(figsize= (20,10))
# Plotting our actual data
plt.subplot(211)
plt.plot(temp2019['temp_in_F'], color= 'green')
plt.title('Actual Temperature in Dallas, Texas: 2019')
plt.xlabel('Date')
plt.ylabel('Temperature (F)');
# Plotting predicted temps
plt.subplot(212)
plt.plot(temp_test_fcsta['yhat'], color = 'red')
plt.title('Predicted Temperature in Dallas, Texas: 2019')
plt.xlabel('Date')
plt.ylabel('Temperature (F)');
plt.tight_layout()
# Test statistics, yo!
mse_temp_b = mean_squared_error(y_true= test_temp['y'],y_pred= temp_test_fcstb['yhat'])
mae_temp_b = mean_absolute_error(y_true= test_temp['y'],y_pred= temp_test_fcstb['yhat'])
def mean_absolute_percentage_error(y_true, y_pred):
"""Calculates MAPE given y_true and y_pred"""
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
mape_temp_b = mean_absolute_percentage_error(y_true= test_temp['y'],y_pred= temp_test_fcstb['yhat'])
# Test statistics go here
print('Test Statistics Temperature (2019 Data Only):')
print('The mean squared error is: ', mse_temp_a)
print('The mean absolute error is: ', mae_temp_a)
print('The mean absolute percentage error is :', mape_temp_a)
print('\n')
print('Test Statistics Temperature (2018-2019 Data):')
print('The mean squared error is: ', mse_temp_b)
print('The mean absolute error is: ', mae_temp_b)
print('The mean absolute percentage error is :', mape_temp_b)
Just as we saw with our crime predictions our error scores are lower when we use our original dataset and our additional data as a testing set. We will keep that model going forward!
Unlike other Auto-regressive models used in time series analysis , Prophet boasts the ability to add regressors, thus making it able to preform multivariate analysis on time series data. We will attempt to do just that below!
# Making a model with both features combined (since our original vs. new data was so successfull let's keep that)
m = Prophet(daily_seasonality = True)
m.add_regressor('temp')
m.fit(all_old)
# Making our predictions using our testing set
forecast_all = m.predict(df = all_new)
forecast_all.head()
# Plot the components
fig = m.plot_components(forecast_all)
plt.title('Exploded Components: Forecast of All Features Together')
plt.tight_layout()
# Cleaning up for plotting
combo_forecast = forecast_all.copy()
combo_forecast.set_index('ds', inplace = True)
temp_2019.set_index('ds', inplace = True)
crime_2019.set_index('ds', inplace = True)
# Plotting our forecast vs. actuals
plt.figure(figsize = (20,10))
plt.plot(combo_forecast['yhat'], color= 'red', label = "Combined Forecast")
plt.plot(crime_2019['y'], color = 'green', label = 'Actual Crime Counts')
plt.plot(temp_2019['y'], color= 'blue', label = 'Actual Temperatures')
plt.legend(loc = 'best')
plt.title('Combined Forecast vs. Actual Crime Count and Temperature')
plt.xlabel('Date');
# Test statistics, yo!
mse_final = mean_squared_error(y_true= all_new['y'],y_pred= forecast_all['yhat'])
mae_final = mean_absolute_error(y_true= all_new['y'],y_pred= forecast_all['yhat'])
mape_final = mean_absolute_percentage_error(y_true= all_new['y'],y_pred= forecast_all['yhat'])
print('Test Statistics- Multivariate Model:')
print('The mean squared error is: ', mse_final)
print('The mean absolute error is: ', mae_final)
print('The mean absolute percentage error is :', mape_final)
Overall, our errors aren't too high, which is great! Unfortunately, without combining all data and splitting into testing groups based on our own criteria we aren't able to test this model again with other parameters and we really don't know how well our model preforms based on other models. Prophet has an additional capability to cross-validate models, so let's try that below to see if we can tune our model any further!
# Importing our needed tools
from fbprophet.diagnostics import cross_validation
df_cv = cross_validation(m, initial='730 days', period='180 days', horizon = '365 days')
df_cv.head()
# Let's look at preformance metrics for this
from fbprophet.diagnostics import performance_metrics
df_p = performance_metrics(df_cv)
df_p.head()
# What do statistics look like the further out you go in time?
df_p.tail()
# Let's plot what this all looks like
from fbprophet.plot import plot_cross_validation_metric
fig = plot_cross_validation_metric(df_cv, metric='mape')
Cross validation in Prophet is, thankfully, pretty much the same as any other cross validation. We use cross validation to help us tune our model and help it make more accurate predictions. In the image above we can see our MAPE for each individual prediction as a dot and the line shows our MAPE for the entire model. Our main takeaway from our cross validation is: the longer out in time we go for predictions the less accurate our model is. This is why our initial test models with split data were less accurate than our original data vs. brand new data.
Time series modeling is a mythical creature in world of data science. It is very difficult learn how to deal with this type of data becase very few resources exist that can explain how to manipulate and translate time series data without requiring a masters in statistics and econometrics. In the future I would like to continue to work on this project and refine my models as I learn more about this specialization.
Want to know what I did with this data? Check out the project page here